Summary of my Work

Load and overview the dataset

Understand the shape of the dataset

Check the data types of the columns in the dataset.

Summary of the data

Observations

  1. Mean value for the age column is approx 45 and the median is 45. This shows that majority of the customers are around years of 45 age. Noted Maximum age is 67. About 75% of are below 55 years old.
  2. Mean amount of CCAVG is approx 1.93K but it has a wide range of values from 0.0 to 10K. We will explore this further in univariate analysis.
  3. Income mean is around 73K and about 75% are earning less than 75%. Noted highest income is around 224K, in USA this type of income is not abnormal. Further about 25 % are earning more than 100K.
  4. Mortgage are also having wide range values. There may be some justification in this the customer who has higher income might have taken a higher mortgage loan. Need to explore more into it.
  5. Noted Experience is in - values,will explore further and may remove - and hold positive values.
  6. ZipCodes : Not sure these are belong to same country, same state - Will explore more into it.
  7. About 50% customers holding graduate degree. Hence we are seeing income is above 70k. A very good sign that customer may accept personal loans.

No missing values in the data set.

There are about 52 rows of experience column having with '-' sign. I treat them at later stage.

  1. With AllLife Bank, About 500 customer are having securities account, 480 have personal_loan with the bank, 300 are having CD_Account. About 30% of total customer having credit card with other banks.
  2. About 40 percent of the customers are having undergrade schooling.
  3. Families are almost evenly distributed leading by singles (around 30%).
  4. About 75% of cusotmers do not have mortgage. A healthy sign to offer a personal loan. But this could vary depends on their income, age, expereince and other factors.
  5. Many of the above colummns can be treated as categerical columns - Family, Experience, Securities_Account, CD_Account, Online, etc..
  6. Other colummns like Age, Income, Mortgage and even CCAvg can be grouped and used as caterical columns for better and easier analysis purpose. This helps to identify the different signments easily.
  7. About 60% of customers have online access to their bank accounts.

since ID is unique value, it is okay to drop the column

EDA

Univeriant Analysis

Observations on Age

  1. The distribution of age is right-skewed. Though there are few outliers (it appears)
  2. I woulld not treat these outliers as they represent the real market trend

Observations on Income

  1. The distribution of the Income is right-skewed
  2. The boxplot shows that there are outliers at the right end
  3. I would not treat these outliers as they represent the real market trend. These are good candidates to offer loans.

Observations on Mortgage

  1. Majority of the customers do not have mortgage.
  2. The boxplot shows that there are outliers at the right end
  3. It may be better to treat them (drop them). Since personal loans are not secured, it is eaiser for them to default on loan if they have any financial troubles like losing job etc.

Observations on Experience

Observations on Mortgage

  1. The distribution of the CCAvg is right-skewed
  2. The boxplot shows that there are outliers at the right end
  3. We will not treat these outliers as they represent the real market trend

Observations on Personal Loan

  1. We have 90% of customers do not have loan with the bank compared 9% accepeted in the last time promotion.
  2. Huge opportunity, exploring more to see which ones will influence the decision.

Observations on Securities_Account

With the bank about 10% are having secured account

Observations on CD_Account

ABout 6 of cusotmers having cd_account

Observations on CreditCard

About 30% of customers are having creditcard with other banks. Some of them spend about 10K on average. Should consider age, income before offer personal loans to these customers. Or perhaps offer a interest discount to transfer these creditcard debts to personal loans is a great idea.

Observations on education

Observations on Family

  1. 29% are singles compared to 70 % of family (24% 4 members).
  2. It is better to target families with 3 and 4 (which is around 45%) subject to their other qualifications. They are more realiable in payments.

Bivariate Analysis

  1. There are overlaps i.e., no clear distinction in the distribution of variables.
  2. I would like to explore this further with the help of other plots

Personal_Loan vs Age

Personal_Loan vs CCAvg

  1. Customers who spent more credit cards are tend to take personal loans.
  2. The above graph shows there is potential opportunity with customer who spent between 0 to 3K.
  3. There are few outliers but this represents the market trend. People who have more buying habbit tend to take personal loans too.

Personal_Loan vs Income

  1. Customers who have more income are taken personal loans in 1st round of campaign
  2. The above graph shows there is potential opportunity with customer who have income above 60K.
  3. There are few outliers but this represents the market trend. People who have more income tend to take personal loans.

Personal_Loan vs Experience

  1. There was no significant Impact of customers experience in taking or offering loans at 1st round campaign.
  2. The graph is dipicting that as long as they have age, good income and other qualfiers it is better to encourage all ages of customers.

Personal_Loan vs Mortgage

  1. There is huge opportunity since majority of the customers do not have mortgage at all, this is what proven in the 1st round campaign too.
  2. There are few outliers but it may be okay to leave them because I already mentioned customers who tend to spend more are tend to opt in loans too.

DATA Cleaning & Preprocessing (Feature Engineering)

I noticed that Income, Mortgage, CCAvg, Age , Family, Experience are having wide varity of values.From univarient and bivarient analysis noticed that Except Experience all other four variables are having greater impact on if customer is decided to take personal loans. The key question for this assignment is 'which variables are most significant'. IN order to find out this , I want to create a bins and use them as categories. To make my analysis more realistic I want to categorise them into 2-3 groups.

Dealing with Income

Noted none of them are having 0 income. The minimum income is around 8K

Dealing Mortgage column

Splitting customers into two buckets because most of the customers do not have mortage at all. We also noticed majority of customers who do not have mortgage loans took personal loans.

Dealing CCAvg

This tells me that majority of the customers CCAvg is below or equal to 3k. Bank should consider customer who have CCAvg should offer incentive to convert these to personal loans.This is huge opportunity for a bank to raise their revenue.

Dealing with Age

Dealing with Experience Column

Noted few of the Experience rows are having - in front of the value. It could be typo since the column allowed signs. Rather than removing all the rows, I want to remove sign and keep the rows.

Looking into ZIPCode column

There are about 467 unique zipcodes. Let me dig into more to get more insight into this column

Noted all these zipcodes are belong to 1 state that is CA. I would not worry much since there are many unique and that too belong to same state. I inclined to drop this column.

Bivariate Analysis after Data Dleaning

There is no significant correlation between the columns. However income,CCAvg, CD_Account have moderate influence on the Personal_loan.

More than 100K income group customers tend to take personal loans compared to 2 and 1.

It is better to target both 3 and 2income group Customers

Mortgage & Personal_Loan

The graph and past compaing clearly shows to target customers who have 0 mortgage.

Summary of EDA

Data Description:

  1. Including Dependent variable Personal_loan all are Int type except CCAvg which is a float type.
  2. Though all varibles are int type. many of them can be put into categerical columns, so it will eaiser to identify sigment of customers
  3. There are no missing values in the dataset.

Data Cleaning: (Feature Engineering)

I realyzed some of the rows from Experience column has -ve sign infront. Considered it is a typo and removed it. I also noted Customers with higher income and who spend more money on credit cards are tend to take personal loans regardless of their age and experience. Also noted More memeber in the family are tend to take more loans. Since income, age, CCAvg, Mortgage are having wide varity of values, I put them into different bins. I noticed ZIPCode has about 467 unique values and I tend to drop this column and experiment with rest of the columns.

Observations from EDA:

  1. Age: Avearge age is around 45 , about 75% of the customers are below 55
  2. Income: Mean value is aroung 73K , about 75% are earning below 98K
  3. CCAvg: median value is around 1.5 about 900 are having about 3k
  4. Mortgage: About 50% do not have mortgage at all but few of the cusotmers have more than 300K loan.
  5. Education: Higher educated customers are tend to take personal loan. which make sense since they have more income.
  6. Family: Having family members of 3-4 tend to take more loans, It make sense and these groups are very loyal.
  7. ZipCode: I noticed that all these are from same country and same state and it has 467 unique values, it is hard to predict which one has more influence, unless done complete work on this column.

Actions for data pre-processing:

  1. Most of the outliers are gone from Mortgage, Income and CCAvg because I tend to put them into categories.Please refer post data cleaning bivariant analysis.
  2. I'm going to drop ZipCode column.

Data Pre-Processing

  1. Dropping ZIPCode column
  2. Checking outliers in the numerical columns.

As I stated above most of the columns are placed in bin, outliers are not showen.

Logistic Regression Analysis

Data Preparation

Building the model

Model evaluation criteria

Model can make wrong predictions as: Predicting a bank can consider customer to issue loan when he/she may not qualify. Predicting a bank can consider customer not to issue a loan when he/she may qualify.

Which case is more important?

How to reduce this loss (i.e need to reduce False Negatives?¶

Logistic Regression

Finding the coefficients

Coefficient interpretations

Converting coefficients to odds

The coefficients of the logistic regression model are in terms of log(odd), to find the odds we have to take the exponential of the coefficients. Therefore, odds = exp(b) The percentage change in odds is given as odds = (exp(b) - 1) * 100

Odds from coefficients

Coefficient interpretations

Checking model performance on training set

Checking performance on test set

ROC-AUC on Training Set

ROC-AUC on test set

Model is giving a generalized performance.

Model Performance Improvement

I would like see if the recall score can be improved further, by changing the model threshold using AUC-ROC Curve.

Optimal threshold using AUC-ROC curve

Checking model performance on training set

Checking model performance on test set

The precision of the model for both training and test set has improved but the F1 score has reduced.

Let's use Precision-Recall curve and see if we can find a better threshold

At the threshold of 0.38, we get balanced recall and precision.

Checking model performance on training set

Checking model performance on test set

Recall has not improved as compared to the initial model.

Model with threshold as 0.38 was not giving a better recall.

#Model Performance Summary

Conclusion

Recommendations

Build Decision Tree Model

Data Preparation

Model can make wrong prediction as:

Which case is more important

How to reduce this loss (i.e need to reduce False Negatives?

Creating common functions to calculate different metrics and confusion matrics.

Few Points about Building tree model

Checking model performance on test set¶

Visualizing the Decision Tree

Reducing over fitting

Using GridSearch for Hyperparameter tuning of our tree model

Checking performance on training set

Visualizing the Decision Tree

Observations from the tree:

Cost Complexity Pruning

The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Total impurity of leaves vs effective alphas of pruned tree¶

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves

Training decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node

Remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. The number of nodes and tree depth decreases as alpha increases.

Maximum value of Recall is at 0.030 alpha, rathan this i would like to choose alpha 0.005, this helps to retaining retaining information and getting higher recall.

checking performance on training set

checking performance on test set

Visualizing the Decision Tree

Creating model with 0.005 ccp_alpha

Checking performance on the training set¶

Checking performance on the test set

Visualizing the Decision Tree

Comparing all the decision tree models

Conclusions

Comparing Logistic Regression and Decision Tree models:

Recommendations